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Computer analysis of genome sequences is 
currently one of the essential steps for 
obtaining functional and structural infor- 
mation about the respective gene products. 
Database searches are used to transfer 
functional features from annotated proteins 
to the query sequences. With the increasing 
amount of data, more and more software 
robots perform this task 1 . While robots are 
the only solution to cope with the flood of 
data, they are also dangerous because they 
can currently introduce and propagate mis- 
annotations 2 - 3 . On the one hand, functional 
information is often only partially transferred 
(underprediction). For example, information 
is not usually extracted for each functional 
unit (protein domain) but just taken from 
the one-line description of the best data- 
base match (so multifunctionality is rarely 
considered). On the other hand, overpre- 
dictions are common because the highest- 
scoring database protein does not necessarily 
share the same or even similar functions. 

Definition and collection of 
uncharacterized protein families 

To avoid unnecessary propagation of 
poor annotation, we have collected puta- 
tive, poorly annotated proteins that are usu- 
ally labeled as ■hypothetical' or just as 'ORF 
(open reading frame). We operationally 
defined uncharacterized protein families 
(UPFs) to be families of proteins that: (1) 
contain members in at least three taxonomi- 
cally distinct (and phylogenetically 'distant') 
species; and (2) do not contain (to the best 
of our knowledge) biochemically charac- 
terized proteins. 

A collection and classification of these 
proteins should allow: (a) utilization of family 
information and thus a more detailed char- 
acterization; (b) simplification of update pro- 
cedures for the entire families if functional 
information becomes available for at least 



one member: and (c) a careful £ 
of functional features that avoids the pitfalls 
described above. 

As the numerous genome sequencing 
projects progress, more and more of these 
UPFs emerge in sequence databases. We 
gave high priority to families that contain 
members in at least two of the three major 
kingdoms (archae, eubacteria, eukaryotes). 
The original 'family' definition was based 
on significant hits in the statistics provided by 
FASTA (Ref. 4) or gapped BLAST (Ref. 5). 

Annotation of UPFs in SWISS-PROT 
and PROSITE databases 

A serial number has been assigned to 
each UPF and to each of the corresponding 
SWISS-PROT (Ref. 6) entries. A SWISS-PROT 
document tile lists all the current I PFs and 
their members in SWISS-PROT. This docu- 
ment is available on the WWW (Ref. 7). In 
the majority of cases, PROSITE entries 8 have 
already been created to document the 
respective family. Whenever a member of a 
UPF family is biochemically characterized, 
that family ceases to be considered as a UPF 
and is deleted from the list. However, infor- 
mation is provided that allows its history to 
be traced. For example: 

Family: UPF0002 [DELETED] 
Taxonomic range: Eubacteria 
Comments: Now characterized as a 
family of pseudouridylate synthases 
(EC 4.2.1.70). 

Prototype: RSUA_ECOLI (Accession No. 
P33918) 

PROSITE entry: PDOC00885 

Function prediction for the UPFs 

The annotation is handled rather con- 
servatively (see below) because functional 
overpredictions are most dangerous given 
the many opportunities for error propa- 
gation in sequence database 2 - 3 . Neverthe- 
less, we intended to retrieve as many func- 
tional features as possible for each UPF 
using comparative analysis. Thus, each 
UPF was subjected to a variety of sequence 
analysis methods?. In brief, several mem- 
bers of each UPF were compared with a 
database ol non-identical protein sequences, 
daily updated at the EMBL using PSI-BLAST 
(Ket. 5) with a conservative expected ratio 
of false positives (E = 0.001) as a threshold 
for each iteration. Sequences were pre- 
processed by filtering for transmembrane 10 
and coiled-coil regions 11 . A multiple align- 
ment was constructed for each I IT using 
ClustalX(Ref. 12). If PSI-BLAST did not iden- 
tify a relationship to characterized proteins, 
other iterative methods such as Wisetools 
(Ref. 13) and Most (Ref. 14) were applied. 
They also use family information, that is, 
give more weight to conserved positions 
and so on, but have the advantage t hat the 
underlying multiple alignments can he 
checked and improved manually (on the cost 
of speed and the 'easy to use' feature). 

Finally, all searches were repeated using 
a sequence database that only contained 
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sequences from entirely sequenced genomes 
to reduce noise effects?- 15 . For example, 
PSI-BLAST E-values depend on the database 
and a database match might be significant 
using a small database but becomes insignifi- 
cant if more background noise (unrelated 
or redundant sequences) is added. 

In many cases, the iterations revealed 
the relationship of the UPFs with other pro- 
teins, families or superfamilies. As the main 
focus here was to assign functional features, 
the iterations have not been continued when 
a reasonable prediction could be made. 
Criteria tor the latter were matches to known 
active site patterns or conserved motifs 
resembling those in PROSITE as well as the 
positioning of UPF members within phylo- 
genetic trees. I tansmembrane regions were 
identified in 13 (22%) of the 58 UPFs, 
although functional predictions for these 
13 have not been made. Of the remaining 
45 UPFs, 25 could be related to proteins 
with annotated functional features (Table 1). 

Pitfalls in function assignments 

'I he prediction-, required careful inspec- 
tion of the functional annotations of the 
matched database proteins. To illustrate the 
difficulties, Table 2 shows the result of a 
Blast search for UPF0002 that includes quite 
a few proteins with annotations (in addition 
to the first hits that are labeled as 'hypotheti- 
cal'). Only one can give a clue about func- 
tional features; others are simply wrong, 
misleading or uninformative. 

Another typical assignment error is 
caused by the sequence similarity of the 
query to a region that is independent from 
the one that was the basis for the annotation. 
For example, the hypothetical protein HI0722 
(Accession No. P44842, ID: YIGZ_HAEIN), 
a member of the UPF0029 family, shows 
significant similarity to two proteins (Gen- 
Bank entries gil2314657 and gil2688341) in 
Helicobacter pylori 2nd Borrelia burgdorferi, 
respect ively. which are wrongly annotated 
as proline dipeptidases (pepQ). The anno- 
tation is based on the N-terminal homology of 
these two proteins with the C-terminal re- 
gion of proline dipeptidase (pepQ) (gil (2358 ) 
of E. coli, which does not harbor the catalytic 
activity of this enzyme. 

There were even examples in which 
homologs scored best in PSI-BLAST (Ref. 5) 
that did not have the same catalytic activity 
because active site residues of the charac- 
terized family were not conserved. How- 
ever, there were significantly lower scoring 
homologs with perfect matches of their 
(distinct) catalytic site residues to the query. 
For example, the UPF0046 family has clear 
amino acid similarity to proteases that are 
easily found by PSI-BLAST (Ref. 5) in the 
fourth iteration; yet, residues involved in 
metal-binding are only shared with a purple 
acid phosphatase family that is only picked 
up in the ninth iteration. The E-value of 
le-5 compared with proteases (E-value of 
5e— 78) remain considerably higher in sub- 
sequence iterations. Such instances have 
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implications lor current function prediction 
programs in which the function of the best 
hit is transferred. Clearly, another gener- 
ation of methods is required that include 
checks lor the presence of functionally 
important residues. 

Use of phylogenetic trees 

As most of the database proteins with 
functional annotations were only distantly 
related to members of the UPl's. transfer of 
functional information is extremely difficult 
and arbitrary. The majority of l l'l's turned 
out to be related to enzymes, and based on 
the conservation of the active site residues 



Tabu; 1. Predicted functional features for 25 UPFs 
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lytic mechanism remains tl 
however, is of little predictive 
families, e.g. those with the 
fold collected in SCOP (Ref. 16) an 
harbor' numerous distinct catalytic 
such as lipases, esterases, dehalogenases, 
peptidases, peroxidases and lyases. We have 
therefore constructed phylogenetic trees of 
selected members of the UPFs and of 
related, but distinct families that have been 
identified during the analysis (Fig. 1). On 
some occasions, the UPF members clearly 
clustered with proteins that all performed 
the same function (Fig. la), but in most of 
the cases the UPFs were of equal distance 
to distinct enzymatic activities (Fig. lb), thus 
not allowing any detailed predictions. 

Although the studied protein families 
were bound to be difficult for function 
predictions because a considerable num- 
ber of teams were unable to find functional 
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a The numbers of family members are approximate because of daily changes in 
databases and loose family definitions. 

b E. coli member also predicted by Koonin et alP (UPF0007: nucleotidyltransferase). 
Abbreviation: UPFs, uncharacterized protein families. 



Table 2. Misleading annotations: PSI-BLAST results for the UPF0002 family (first iteration) 



Ranking Annotation 



Probability Commentary 



Gnl I PID I e332795 (Z98268) hypothetical protein 

MTCI1 25.33 [Mycobacterium tuberculosis]... 
Sp I P33643 I SFHB_EOOLI SFHB PROTEIN 



Gnl I PID I el 185138 (Z99I 12) alternative gene 
name: ylmL; similar to hypothetical proteins 
[Bacillus subtilis]. . . 

Sp I Q12362 I R1B2_YEAST DRAP DEAMINASE 
>gi I 1078332 I pir I IS50972 KIB2 protein - yeast 
(Saccharomyces cerevisiae) >gil64222l (X2I618) 
DRAP deaminase ISaccharoinvces cerevisiael 
>gi I I 1 1988- I gnl I PID I e2522~9 (X - 1 808) ORE 
Y< )l.06(x' ISaccharoinvces cerevisiael... 

Sp I P33918 I RSUA_ECOLl 16S PSEUDOUR11DYLATE 
516 SYNTHASE (16S PSEUDOUR1D1NE 516 
SYNTHASE) (URACIL HYDROLYASE) 

sp l Q47417 1 yqcb_erwc;a EXOENZYME 

REGULATION REGULON ORF1 >gi 1 628643 I 
pirl IS45I07 hypothetical protein I — Krwinia 
carotovora >gi I 496598 (X79474) ORIT tErwinia 
carotovoral. . . 



SI'I 115 is a gene name (suppressor ol the 
temperature-sensitivity of ftshl mutation) 
and doe, not give much functional insight 



function prediction based on this pro 



Misleading annotation, operon 
architecture i.s not conserved lx.-twe< 
.species 



Annotations that hamper' (unction::! predictions illus;rated by the example of the I 'Pf'0(X)2 family. Based on the recent experi- 
mental characterization ot pseudouridylale synthase « this family has been deleted from the UPF list (see text). Nevertheless, the 
various, partly contradictory annotations (bold) are extremely difficult to parse for automatic function prediction programs, 
for brevity, the PSI-BLAST results have been cut (...). 
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Figure 1. (a) Phylogenetic trees of selected members of UPF0007 that indicate a likely 
function as UPF0007 members with cytidyltransferase activities (red) and related 
uridilyltransferases (blue) are more divergent (*pir database entry, pirlg64l56; **pir 
database entry, pirls49238). (b) No clear enzymatic activity can be predicted for UPF0017 
members: They clearly have the hydrolase fold but have equal distance to peroxidases 
(red), esterases (green), peptidases (blue) and other hydrolases (pink) (***GenBank entry 
gil 1001804). The trees were calculated using CLUSTALX (Ref. 12). 



features therein, it is noteworthy that there 
was not a single case in which we were 
able to predict the precise mechanism and 
the substrate specificity. Nevertheless, the 
information about an enzymatic activity and 
the likely reaction mechanisms of I he 25 
UPFs should prove useful for the analysis of 
upcoming genome sequences. 

Annotation with the right level of 
precision helps in future projects 

some functional annotation for more than 
700 of about 1300 proteins clustered in 25 
of the 58 distinct UPFs. Most of them are 
currently named 'hypothetical protein' so 
that their annotation adds enormous value 
to these sequences. For another 13 UPFs 
currently containing about 250 proteins, 
the presence of transmembrane regions 
was recorded. This annotation is now being 
incorporated into PROSITE and SWISS-PROT 
so that these features can be assigned to 
newly sequenced genes as well. 

The difficulties we faced in assigning 
functions by sequence similarity also indi- 
cate that many of the automatic predictions 
by most of the soli ware r< >b< >ts are probably 
erroneous. Because of I he current policies of 
most of the sequence databases, correction 
of annotations is very hard to realize. Thus, 
there should be a combined effort by the 
database teams, the authors of the current 
entries, and the community, to work towards 
a careful functional annotation of all the 
sequences that become publicly available. 
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New Core Protocol articles published recently in Technical Tips Online include: 

i Mitchell, T.J. and Morely, B.J. (1998) Isolation of RNA and analysis by 
northern blotting and primer extension Technical Tips Online 
(http://www.elsevier.com locate tto) P01286 
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